Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] New ConnectorV2 API #03: Introduce actual ConnectorV2 API. (#41074) #41212

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 16, 2023

READY FOR INITIAL REVIEW AND FEEDBACK. TEST CASES PENDING (TODO).

Introduce new ConnectorV2 API:

The new ConnectorV2 API will replace the existing Connector API and introduce the following enhancements and changes:

  • Keep as-is: A "connector" remains a callable, pluggable piece within a "connector pipeline" (which itself is a connector).
  • Keep as-is: A "connector pipeline" can be assembled and disassembled by adding, inserting, removing connector pieces at/from arbitrary locations.
  • New design: There are now 3 sub-types of connector pipelines to increase explicitness:
    • Env-to-Module pipeline: Used in EnvRunners to transform environment outputs (i.e. SingleAgentEpisodes) into RLModule's forward_exploration|inference() batches to compute the next action.
    • Module-to-Env pipeline: Used in EnvRunners to transform RLModule forward_exploration|inference() outputs into action(s) that can be sent to a gym.Env.
    • Learner pipeline: Used inside a Learner actor to transform incoming sample batches and/or episode objects (from EnvRunners or buffers) into the final train batch used for the Learner.update() call.
  • New design: RLlib will automatically provide a single default connector piece at the end of each of the above 3 pipelines to always ensure that a) we will have at least the most recent observation available in the batch (under the SampleBatch.OBS key) and b) iff the RLModule is stateful, we will have the most recent RLModule output STATE_OUT as next STATE_IN within the resulting RLModule input batch. Alternatively, at the beginning of an episode, STATE_IN will be the RLModule's initial state (stacked/repeated by the batch size). Also, for the Learner pipeline, the RLlib default connector piece will make sure all data will have the proper time-rank added.
  • Changes in design: When calling a "connector", we pass as call args the previous connector's (within the same pipeline) outputs as well as all currently sampled Episode objects. This way, any connector has access to all data that already got stored previously in the ongoing episode, for example previous rewards/actions, recent observations or - in a multi-agent setting - the observations made by agents other than ourselves. This unlocks a range of new usecases that were previously not supported.
  • Changes in design: When calling a "connector", it has access to a) the current RLModule, b) the gym.vector.Env, c) the current explore (True|False) setting, d) and any data possibly passed by previous connectors (even from another pipeline). For example, an env->module connector might want to pick the particular single-agent RLModule to be used for a given agent ID and then let the following module->env connector pipeline know, how it picked, such that it can properly convert back from module output to actions. This way, we can eventually replace the policy_mapping_fn functionality currently hardcoded into RolloutWorker by a ConnectorV2.
  • New design: Users can now more easily configure their custom connectors, something that previously was only possible through complex callbacks and extracting the Policy object from deep inside the algo. Instead, callables can now be defined and added to the AlgorithmConfig. These callables take a gym.Env or a set of obs-action-spaces and return the respective connector pipeline (used on the EnvRunners and the Learners, respectively).

To list some of the advantages that this new design will offer our users:

  • Many old stack APIs will simply be replaced by connectors: Filters, Preprocessors, Policy.postprocess_trajectory, trajectory view API, tons of hard-coded logic on action clipping, reward clipping, Atari frame stacking, RNN-time-rank handling and zero padding, etc... A dummy example showing how frame-stacking can be achieved with connectors is part of this PR.
  • The clear separation of EnvRunner connectors (env-to-module and module-to-env) on one side and Learner connectors on the other side will allow us to NOT worry anymore within an RLModule's forward_exploration|inference() implementation about what the algo's training step might or might not need. For example, in PPO, we can now offload the vf-predictions entirely onto the learner side and perform vf computations (including bootstrap value computations at the truncation edges of episodes) in a batched and distributed (multi-GPU) fashion.
  • With these new connector's help, the EnvRunner main loop will simplify to:
while ts < num_timesteps:
  # Connector from env to RLModule.
  to_module = self.env_to_module(episodes=self._episodes, explore=..., rl_module=...)

  # RLModule forward calls.
  if explore:
    mod_out = self.module.forward_exploration(to_module)
  else:
    mod_out = self.module.forward_inference(to_module)

  # Connector from RLModule back to env.
  to_env = self.module_to_env(
    input=mod_out, episodes=self._episodes, explore=..., rl_module=...
  )

  # Native gym.vector.Env `step()` call.
  obs, rewards, terminateds, truncateds, infos = (
    self.env.step(to_env[SampleBatch.ACTIONS])
  )
  ts += self.num_envs

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 changed the title [RLlib] Preparatory PR: Make EnvRunners use (enhanced) Connector API (#03: introduce ConnectorV2 API) [RLlib] New ConnectorV2 API #03: Introduce actual ConnectorV2 API. (#41074) Nov 17, 2023
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…v2_api

Signed-off-by: Sven Mika <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
…_connectorv2_api' into env_runner_support_connectors_03_connectorv2_api

# Conflicts:
#	rllib/algorithms/algorithm_config.py
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@@ -0,0 +1,116 @@
from functools import partial
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built-in framestacking connector to use for Atari.

Optionally move this into examples folder as RLlib does not use this automatically (user has to explicitly configure this connector via config.env_to_module_connector = lambda env: FrameStackingEnvToModule(env=env, num_frames=4)).

@@ -0,0 +1,134 @@
from functools import partial
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built-in prev action/reward connector.

Optionally move this into examples folder as RLlib does not use this automatically (user has to explicitly configure this connector via config.env_to_module_connector = lambda env: PrevRewardPrevActionEnvToModule(env=env, n_prev_actions=1, n_prev_rewards=10)).

@@ -0,0 +1,75 @@
from enum import Enum
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We'll have to see once we write the default multi-agent env-to-module and module-to-env connector logic, whether we even need these input/output types.
I'm not that sure anymore. Maybe a simple input space -> output space (as it already exists) will be sufficient. With input_space being a Dict mapping agentIDs to individual agent spaces and output space being another Dict mapping moduleIDs to individual spaces.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sort of stamping here @sven1977 unfortunately don’t have the time to review this big PR. I just noticed there are some variable name and documentation inconsistencies early on in the review. Please take another pass on them. The base class of connectors looks good and is inline with what we discussed. Hopefully, the next example PRs are gonna be smaller and I can see more concretely see how these tie up to each other. Thanks.

Args:
num_frames: The number of observation frames to stack up (into a single
observation) for the RLModule's forward pass.
as_preprocessor: Whether this connector should simply postprocess the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not defined in the signature?

self,
*,
rl_module: RLModule,
input_: Any,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call data instead of input_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

explore: Whether `explore` is currently on. Per convention, if True, the
RLModule's `forward_exploration` method should be called, if False, the
EnvRunner should call `forward_inference` instead.
persistent_data: Optional additional context data that needs to be exchanged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call ir something else? Maybe shared_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 merged commit bd555a0 into ray-project:master Dec 21, 2023
9 checks passed
@sven1977 sven1977 deleted the env_runner_support_connectors_03_connectorv2_api branch December 21, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants